Introduction

This project is to help an online greeting card company to understand the life-time-value of their customers. The company want to further improve their revenue by identifying loyal customers and ways they can retain them. Also, the company want to identify inactive users from the active ones. To achieve this goal, it will need a customer segmentation scheme. Therefore, this project will solve the problems as folowwing:

a). A model to predict whether a customer will cancel their subscription in the near future
b). A model to estimate the life-time-value for a customer
c). A customer segmentation scheme to identify inactive vs active users

The client provided 403,835 daily usage records of 10,000 customers from January 1st, 2011 to December 31st, 2014. The dataset includes several statistics of customers’ behaviors when he/she visits the website as follows:

Data Field Description
id A unique user identifier
status Subscription status ‘0’- new, ‘1’- open, ‘2’- cancelation
gender User gender ‘M’- male, ‘F’- female
date Date of in which user ‘id’ logged into the site
pages Number of pages visted by user ‘id’ on date ‘date’
onsite Number of minutes spent on site by user ‘id’ on date ‘date’
entered Flag indicating whether or not user entered the send order path on date ‘date’
completed Flag indicating whether the user completed the order (sent an eCard)
holiday Flag indicating whether at least one completed order included a holiday themed card

Methodology and Findings

Explore Data
## 'data.frame':    403835 obs. of  9 variables:
##  $ id       : int  1 1 1 1 1 1 2 2 2 2 ...
##  $ status   : int  0 1 1 1 1 2 0 1 1 1 ...
##  $ gender   : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 1 1 1 1 ...
##  $ date     : Factor w/ 1461 levels "2011-01-01","2011-01-02",..: 1400 1401 1402 1445 1449 1453 981 982 983 985 ...
##  $ pages    : int  7 6 6 1 1 0 7 9 7 8 ...
##  $ onsite   : int  3 8 20 1 1 0 23 30 19 3 ...
##  $ entered  : int  1 0 1 0 0 0 1 1 1 1 ...
##  $ completed: int  1 0 0 0 0 0 1 1 1 0 ...
##  $ holiday  : int  0 0 0 0 0 0 0 0 0 0 ...
##        id            status       gender             date       
##  Min.   :    1   Min.   :0.0000   F:358716   2013-12-24:  1132  
##  1st Qu.: 2537   1st Qu.:1.0000   M: 45119   2012-12-24:   954  
##  Median : 5025   Median :1.0000              2013-12-17:   821  
##  Mean   : 5017   Mean   :0.9909              2013-12-18:   808  
##  3rd Qu.: 7495   3rd Qu.:1.0000              2013-10-27:   801  
##  Max.   :10000   Max.   :2.0000              2013-12-23:   789  
##                                              (Other)   :398530  
##      pages            onsite           entered         completed     
##  Min.   : 0.000   Min.   :  0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000   1st Qu.:  2.000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median : 5.000   Median :  5.000   Median :1.0000   Median :1.0000  
##  Mean   : 5.018   Mean   :  8.831   Mean   :0.7821   Mean   :0.5627  
##  3rd Qu.: 7.000   3rd Qu.: 11.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :10.000   Max.   :220.000   Max.   :1.0000   Max.   :1.0000  
##                                                                      
##     holiday      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2277  
##  3rd Qu.:0.0000  
##  Max.   :1.0000  
## 

After initial exploration of the data, it shows:

  1. No missing values(NA) present in the data

  2. No discrepancy in the collection of observations

  3. Number of females are way more than males

Data requires no further cleaning.

Feature Engineering

As the usage statistics data provided by client is at daily level, here first aggregate the data to customer level, and create more features that might help in exploring the life-time value of customers. When calculating time duration related features, we assume that the last registration date for those customers who have not cancelled their accounts are 12-31-2014, as this is the last day of effective data range of our dataset.

The description of each feature is as below:

Feature Description
gender User gender ‘M’- male, ‘F’- female
totalLoginNum Total number of logins of the customer over the time the customer stays with the company
avgPages Average number of pages visited by the customer over the time the customer stays with the company
maxPages Maximum number of pages visited by the customer for a single login over the time the customer stays with the company
minPages Minimum number of pages visited by the customer for a single login over the time the customer stays with the company
sdPages Standard deviation of number of pages visited by the customer over the time the customer stays with the company
skewPages Skewness of number of pages visited by the customer over the time the customer stays with the company
avgOnsite Average number of minutes spent on site by the customer over the time the customer stays with the company
maxOnsite Maximum number of minutes spent on site by the customer for a single login over the time the customer stays with the company
minOnsite Minimum number of minutes spent on site by the customer for a single login over the time the customer stays with the company
sdOnsite Standard deviation of number of minutes spent on site by the customer over the time the customer stays with the company
skewOnsite Skewness of number of minutes spent on site by the customer over the time the customer stays with the company
avgOnsitePageTime Average number of minutes the customer stays in one page
maxOnsitePageTime Maximum number of minutes the customer stays in one page
minOnsitePageTime Minimum number of minutes the customer stays in one page
sdOnsitePageTime Standard deviation of number of minutes the customer stays in one page
skewOnsitePageTime Skewness of number of minutes the customer stays in one page
avgEntered Average number of entered orders of the customer
sumEntered Total number of entered orders of the customer
avgCompleted Average number of completed orders of the customer
sumCompleted Total number of completed orders of the customer
cmplOverEtr Ratio of total completed orders over total entered orders of the customer
YesHoliday Total number of completed orders on holiday
NoHoliday Total number of completed orders that are not made on holiday
conversion Average conversion rate from page to order entered page of the customer
firstMonthLoginNum Number of login during the first month of the customer’s subscription
lastMonthLoginNum Number of login during the last month of the customer’s subscription subscription, for customers who haven’t cancelled, their last month is December 2014
oneMonthLoginRatio Ratio of firstMonthLoginNum over lastMonthLoginNum
Q1 Number of login during the first quarter (January to March every year) of the customer’s subscription
Q2 Number of login during the second quarter (April to June every year) of the customer’s subscription
Q3 Number of login during the third quarter (July to September every year) of the customer’s subscription
Q4 Number of login during the last quarter (October to December every year) of the customer’s subscription
avgDateDiff Average time difference between adjacent logins of the customer
maxDateDiff Maximum time difference between adjacent logins of the customer
minDateDiff Minimum time difference between adjacent logins of the customer
sdDateDiff Standard deviation of time differences between adjacent logins of the customer
Data Mining Methods and Findings

With these features, next start to analyze the following questions:

  • Whether a customer will unsubscribe the service in near future?
  • What is the life-time value of each customer?
  • How to segment customers so that we can identify sleeping customers?

In the following sections, we will walk through the features and predictive/descriptive methods we have fitted in each problem, compare the performance of various models we have tried, and summarize findings. Before fitting each model, we split our data into training set (80% of all customer data) and test set (the rest 20% of the customer data), train the models on training set and compare them by their performance on the test set.

Problem 1: Whether a customer will unsubscribe the service in near future?

This is a classification problem, therefore, we consider: Logistic Regression with lasso, Random Forest, Naive Bayes and K-Nearst Neighbour model.

These methods were considered for this scenario because it is a classic classification problem. There are advantages and disadvantages for all the mentioned models. We will briefly go over them before discussing the process of selecting the model.

These are the following advantages and disadvantages considered for the following models:

Random Forest: Some of the advantages considered for this model are that it is a highly accurate learning algorithm and highly flexible. It also indicates variable importance. But the disadvantage is that it will be hard to interpret.

Regularized Lasso: The advantages for regularized lasso is that it automatically select the best variables, but it has a hard time detecting interaction term in the model. It also makes the assumption that the model will be linear.

Naive Bayes: Scales well to problems with large number of predictor. However, the downside with this approach is that all the inputs are independent in each class (can be a problem if the variables are collinear).

K-NN Model: KNN is a model that is highly flexible, but hard to interpret

To select an optimized model, the data scientists utilized K-fold cross validation to determine the overall accuracy of the estimate of the train set. Afterwards, the interpretability and flexibility of the model was considered for the business scenario in which we are accurately trying to identify the customers who will cancel their subscription.

Model 1: Logistic Regression with lasso

The advantages for regularized lasso is that it automatically select the best variables, but it has a hard time detecting interaction term in the model. It also makes the assumption that the model will be linear.

From the plot above, we choose optimal \(\lambda\) = 0.0036 using 1-SE rule, and the optimal model selects 24 variables. The variables and their coefficients are in below table:

Coefficient
(Intercept) 5.593
genderM -0.946
avgPages -5.984
maxPages 1.377
minPages 1.322
sdPages 3.551
skewPages -7.423
maxOnsite 0.001
avgOnsitePageTime -0.006
minOnsitePageTime -1.535
skewOnsitePageTime 0.074
avgEntered -1.373
avgCompleted -2.575
YesHoliday 0.020
conversion 5.567
firstMonthLoginNum 0.172
lastMonthLoginNum 0.498
firstMonthLoginRatio 3.413
Q1 0.021
Q2 0.011
Q3 0.002
Q4 -0.003
avgDateDiff 0.044
maxDateDiff 0.007
minDateDiff -0.832

Then we test the model on test data, and get confusion matrix as below.

##           Observation
## Prediction    0    1
##          0  599   92
##          1  129 1178

The misclassification rate on test data is 11.06%.

##                result
## accuracy    0.8893894
## sensitivity 0.9275591
## specificity 0.8228022
## ppv         0.9013007
## npv         0.8668596
## precision   0.9013007
## recall      0.9275591
Model 2: Random Forests

Some of the advantages considered for this model are that it is a highly accurate learning algorithm and highly flexible. It also indicates variable importance. But the disadvantage is that it will be hard to interpret. Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. We build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors. Fit random forests to the training data, and get the importance of different variables as below:

Then we test the random forests on test set, and get confusion matrix as below:

##           Observation
## Prediction    0    1
##          0  539   69
##          1  189 1201

The misclassification rate on test data is 12.91%.

##                result
## accuracy    0.8708709
## sensitivity 0.9456693
## specificity 0.7403846
## ppv         0.8640288
## npv         0.8865132
## precision   0.8640288
## recall      0.9456693
Model 3: Naive Bayes

Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong independence assumptions between the features. Scales well to problems with large number of predictor. However, the downside with this approach is that all the inputs are independent in each class (can be a problem if the variables are collinear). Although our features may not be totally independent given the class value, Naive Bayes sometimes perform well on large feature space. Thus we decide to try it. We fit Naive Bayes to the training data, then test it on test set and get confusion matrix as below:

##           Observation
## Prediction    0    1
##          0  510  182
##          1  218 1088

The misclassification rate on test data is 20.02%.

##                result
## accuracy    0.7997998
## sensitivity 0.8566929
## specificity 0.7005495
## ppv         0.8330781
## npv         0.7369942
## precision   0.8330781
## recall      0.8566929
Model 4: K Nearst Neighbour

KNN is a model that is highly flexible, but hard to interpret

Then we test the KNN on test set, and get confusion matrix as below:

##           Observation
## Prediction    0    1
##          0  327  258
##          1  401 1012

The misclassification rate on test data is 32.98%.

##                result
## accuracy    0.6701702
## sensitivity 0.7968504
## specificity 0.4491758
## ppv         0.7162067
## npv         0.5589744
## precision   0.7162067
## recall      0.7968504

By comparing the test set performance of these 4 models, we identify that Logistic Regression with lasso (\(\lambda\) value chosen by 1-SE rule) performs the best. Below is the table comparing the misclassification rate of 4 models, among which Logistic Regression with lasso generates the lowest misclassification rate.

Model Misclassification Rate
Logistic Regression with lasso 11.06%
Random Forests 12.91%
Naive Bayes 20.02%
KNN 32.98%

From the ROC curves of Logistic Regression with lasso (black curve), KNN (steel blue curve), and Random Forests (green), we can also tell that Logistic Regression with lasso performs best on predicting whether a customer will unsubscribe the service in near future. Compared with the prevalence of cancelled customers in our training dataset, 63.14%, the 88.94% accuracy rate that Logistic Regression with lasso has achieved is outstanding, thus we can apply this model to future users and make strategies to activate them if we predict that they will cancel their account.

Problem 2: What is the life-time value of each customer?

In problem 2, we estimate the life-time value for the customers, which means the total revenue earned by the company over the course of their relationship with the customer. Therefore, this task is a regression problem as the output is numerical value. We decided to analyze this task on two different group of people: all customers, and only the customers who have already cancelled their accounts. Here try three different models - Linear Regression with lasso, Regression Tree, and Random Forest. We are using 37 features this time - 36 features listed in Feature Engineering and whether the customer has cancelled the account or not. We calculate the ‘ltv’ as the response for this regression problem, which is the real life-time value of the customers - for those who have not cancelled their accounts, we calculate their ltv by 12-31-2014 - the end date of the dataset. When comparing the performance of different classifiers, we use Mean Absolute Error as our metrics, as it can directly tell us how far away are our predictions from the true values.

All Customers
Model 1: Linear Regression with lasso

Just like Model 1 in Problem 1, lasso automatically does feature selection when fitting a regularized linear regression to predict ltv. We choose the \(\lambda\) value using cross-validation, and the CV error plot is as below:

From the plot above, we choose optimal \(\lambda\) = 0.0115606 using 1-SE rule, and the optimal model selects 29 variables. The variables and their coefficients are in below table:

Coefficient
(Intercept) -3.392
cancelled1 -0.266
genderM 1.560
totalLoginNum 0.202
maxPages 0.651
minPages -0.107
sdPages -1.248
skewPages 0.520
avgOnsite 0.074
maxOnsite 0.003
minOnsite 0.370
minOnsitePageTime -2.135
sdOnsitePageTime -0.542
skewOnsitePageTime 0.359
avgEntered -1.129
sumEntered -0.002
avgCompleted 2.483
YesHoliday 0.510
NoHoliday -0.237
conversion 1.066
firstMonthLoginNum -0.027
lastMonthLoginNum -0.147
firstMonthLoginRatio -2.675
Q1 0.118
Q2 0.065
Q3 0.074
avgDateDiff 0.530
maxDateDiff 0.080
minDateDiff -0.911
sdDateDiff -0.343

Then we test the model on test data, and the Mean Absolute Value is 2.43.

Model 2: Regression Tree

Trees are highly interpretable, thus we also try to fit the regression tree on the training dataset for this regression task. We choose the optimal tree size using cross-validation, and the CV error plot is as below:

From the plot above, we get optimal tree size = 8 using 1-SE rule. We prune the tree using complexity parameter chosen by 1-SE rule, and the optimal tree is as below:

Then we test the model on test data, and the Mean Absolute Value is 3.82.

Model 3: Random Forests

We fit Random Forests on training data for this regression task, and get the importance of different variables as below:

Then we test the model on test data, and the Mean Absolute Value is 1.13.

By comparing the test set performance of these 3 models, we identify that Random Forests perform the best. Below is the table comparing the Mean Absolute Error of 3 models, among which Random Forests generates the lowest Mean Absolute Error.

Model Mean Absolute Error
Linear Regression with lasso 2.43
Classification Tree 3.82
Random Forests 1.13

Therefore, Random Forests performs best on predicting the life-time value for all customers.

Customers Who Have Already Cancelled Their Accounts

There are 6,314 customers who have already cancelled their accounts. We use exactly the same three models as we do for all customers.

Model 1: Linear Regression with lasso

We choose the \(\lambda\) value for lasso using cross-validation, and the CV error plot is as below:

From the plot above, we choose optimal \(\lambda\) = 0.0125846 using 1-SE rule, and the optimal model selects 30 variables. The variables and their coefficients are in below table:

Coefficient
(Intercept) -4.582
genderM 1.278
totalLoginNum 0.216
maxPages 0.637
minPages -0.233
sdPages -1.165
skewPages 0.620
avgOnsite 0.093
maxOnsite 0.005
minOnsite 0.542
avgOnsitePageTime 0.087
maxOnsitePageTime 0.018
minOnsitePageTime -3.200
sdOnsitePageTime -0.879
skewOnsitePageTime 0.367
sumEntered -0.016
avgCompleted 2.140
cmplOverEtr -0.063
YesHoliday 0.534
NoHoliday -0.241
conversion 0.270
firstMonthLoginNum -0.006
lastMonthLoginNum -0.059
firstMonthLoginRatio -2.088
Q1 0.089
Q2 0.075
Q3 0.082
avgDateDiff 0.505
maxDateDiff 0.073
minDateDiff -0.808
sdDateDiff -0.306

Then we test the model on test data, and the Mean Absolute Value is 2.04.

Model 2: Regression Tree

We also try to fit the regression tree on the training dataset for customers who have cancelled their accounts. We choose the optimal tree size using cross-validation, and the CV error plot is as below:

From the plot above, we get optimal tree size = 9 using 1-SE rule. We prune the tree using complexity parameter chosen by 1-SE rule, and the optimal tree is as below:

Then we test the model on test data, and the Mean Absolute Value is 3.61.

Model 3: Random Forests

We fit Random Forests on training data for customers who have cancelled their accounts, and get the importance of different variables as below:

Then we test the model on test data, and the Mean Absolute Value is 0.94.

By comparing the test set performance of these 3 models, we identify that Random Forests perform the best. Below is the table comparing the Mean Absolute Error of 3 models, among which Random Forests generates the lowest Mean Absolute Error.

Model Mean Absolute Error
Linear Regression with lasso 2.04
Classification Tree 3.61
Random Forests 0.94

Therefore, Random Forests performs best on predicting the life-time value for customers who have already cancelled their accounts as well. We can see that the Mean Absolute Error gets further reduced for customers who have already cancelled than all customers. Customers who have not cancelled but registered at a late time can have a very short life-time value per our calculation, thus eliminating them may make the prediction task easier and the Mean Absolute Error is even lower. Random Forests give us impressively low error in this task, which means that our client could use this approach to predict the life-time value of their future users and make forecast of their revenues more accurately.

Problem 3: How to segment customers so that we can identify sleeping customers?

In the third problem, we need to develop a customer segmentation scheme which can help our client identify those sleeping customers, who are no longer active but have not cancelled their account yet. This is an unsupervised learning task because there is no labels for this problem. So, clustering is a great method to shed light on this problem [Reference 7].

Before we cluster the customers, we first explore the data. After plotting the number of days between the last login date and 12-31-2014, the end date of this dataset, which is used as the last day of relationship with this company for customers who have not cancelled their accounts, we think 15 days could be a reasonable cutoff for deciding whether a customer is sleeping or not. If a customer has not logged in for more than 15 days, we will identify him/her as a sleeping customer.

We have 3681 customers who have not cancelled their accounts, among whom 841 are identified as sleeping customers.

To determine the important features to be used in clustering the sleeping customers, we first fit Random Forests to the dataset as a classification task, and use the importance of variables ranked by Random Forests to identify crucial features that can tell us whether a customer is sleeping or not. The importance of features are as below:

From the importance rank shown above, we choose oneMonthLoginRatio, lastMonthLoginNum, sdDateDiff, avgDateDiff, maxDateDiff and ltv (customer life-time value calculated in Problem 2) as highly relevant features to use in clustering. Then we also conduct a feature cluster and see whether there are other features that are highly correlated with the number of days each customer is sleeping.

From the dendrogram above, we plot a horizontal red line at correlation=0.6, and we find that customers’ sleeping days is not really correlated with other features. Finally, we cluster the customers with lastMonthLoginNum, oneMonthLoginRatio, avgDateDiff, sdDateDiff, maxDateDiff, ltv, and sleepDays using Hierarchical Clustering. We scale all the features used in clustering. We are inspired by Tal Galili [Reference 2] who developed a hierarchical cluster analysis on Iris Dataset and develop below heatmap for clustering:

From the heatmap, we can tell that customers with large avgDateDiff, sdDateDiff, maxDateDiff, and sleeping days, while low lastMonthLoginNum and oneMonthLoginRatio are clustered together in the cluster marked in blue. This result aligns with our understanding of customers: customers with low activity - which can be indicated by large date difference between logins and low last month login times compared to their login times in the first month, are highly likely to sleep. Therefore, our choices of features in clustering are actually associated with the sleeping status of a customer.